Bridging the gap between coding and programming

Adam Loy

Carleton College

ICSA 2022 Applied Statistics Symposium

Students need to be able to program for and with data, not simply write data-driven code

Students need to understand computational principles, not simply write scripts for data analysis

Coding

Programming

We’ve been teaching coding skills

  • Data wrangling

  • Rendering graphics

  • Conditional execution

  • Iteration

  • Writing functions

  • Web scraping

Students need programming skills

  • Developing logical solution to a problem

  • Planning a large/complex software project

  • Modularization

  • Profiling

  • Unit testing

Each has a place in the curriculum

Coding

Focus: statistical content

  • Data acquisition

  • Data exploration

  • Inference

  • Communication

Programming

Focus: problem solving and technical skills acquisition

  • Data science

  • Computational statistics

  • Capstones

All statistics majors should be exposed to this set of ideas within the curriculum

Supporting programming in the curriculum

Course modules

  • We could design whole courses around this idea, but requires more resources and curricular redesign

  • More realistic to add a module/topic/assignment exploring one skill

  • Modules can be added to courses with less “disruption”

Example: adding modularization

Modularization

  • the process of identifying and creating a set of reliable helper functions to support a complex computational goal

  • helps students think about the solution before they attempt to code it

  • makes code easier to read and debug

  • robust – unit tests are easier to write

Traditional or Reactive programming?

“Traditional” module

Who?

What?

When?

Where?

How?

Upper-division statistics students; intro data science students

Modularize familiar code (i.e., recipes from other courses)

In a statistics computing course or capstone, data science course

Either in class or as homework

Start with a familiar implementation, and provide scaffolded path to create a modularized function

Implementing the
one-sample bootstrap

B <- 1000
stats <- numeric(B)

for(i in 1:B) {
    x <- sample(obs_sample, replace = TRUE)
    stats[i] <- mean(x)
}

ggplot(data = NULL, aes(x = stats)) +
    geom_histogram()

data.frame(
    observed = mean(obs_sample),
    se = sd(stats),
    mean = mean(stats),
    bias = mean(stats) - mean(obs_sample)
)
  1. Carefully read through the code, noting what each line of code does.

  2. Create a list of small/simple tasks that are implemented in the above code chunk.

  3. Discuss this your answer to #2 with your group. Settle on a list of small/simple tasks that are implemented and record these on the shared Google doc

  1. Are any of the simple tasks already implemented in an R function? If so, which ones?

  2. Write an R function for each simple task identified. These functions are your set of helper functions.

resample <- function(obs_sample, B) {
  stats <- numeric(B)
  for(i in 1:B) {
    x <- sample(obs_sample, replace = TRUE)
    stats[i] <- mean(x)
  }
  stats
}

plot_hist <- function(stats) {
  ggplot(data = NULL, aes(x = stats)) +
    geom_histogram()
}

calc_summary <- function(obs_sample, stats) {
  data.frame(
    observed = mean(obs_sample),
    se = sd(stats),
    mean = mean(stats),
    bias = mean(stats) - mean(obs_sample)
  )

}

Using your set of helper functions, create a bootstrap() function that takes

  • obs_sample = observed data vector
  • B = # of resamples

as inputs, and prints the plot and returns bootstrap summary data frame

 

Sample solution:

bootstrap <- function(obs_sample, B) {
  boot_stats <- resample(obs_sample, B)
  print(plot_hist(boot_stats))
  calc_summary(obs_sample, boot_stats)
}

Test your function

# Data set for testing
data(Verizon, package = "resample")
clec <- Verizon %>% filter(Group == "CLEC") %>% pull(Time)

# How do your results compare to the below?
Summary Statistics:
     Observed      SE     Mean        Bias
mean 16.50913 3.94825 16.42974 -0.07939087

Here’s a non-modularized version of this function. Compare and contrast the readability to your function.

bootstrap_nonmod <- function(obs_sample, B) {
  stats <- numeric(B)

  for(i in 1:B) {
    x <- sample(obs_sample, replace = TRUE)
    stats[i] <- mean(x)
  }
  
  p <- ggplot(data = NULL, aes(x = stats)) +
    geom_histogram()
  print(p)
  
  data.frame(
    observed = mean(obs_sample),
    se = sd(stats),
    mean = mean(stats),
    bias = mean(stats) - mean(obs_sample)
  )
}

Shiny-based approach

Who?

What?

When?

Where?

How?

Intro data science students

Streamline interactivity in a Shiny app

In a data science course

Either in class or as homework

Start with a clunky Shiny app, provide guided exploration to discover and fix issues

Building a one-sample bootstrap app

Exploration questions

  1. What output changes when you change…
    • the number of bootstrap resamples?
    • the number of histogram bins?
    • the confidence level?
  2. Looking back through #1, are there any surprises? Does anything change/update unexpectedly? Does anything fail to update?

  1. Read through the code in the renderPlot() and renderPrint() reactive expressions. Note the key tasks executed within each.

  2. Is there any replication? That is, are any key tasks executed multiple times?

h3("Bootstrap distribution")
renderPlot({
  boot <- resample(obs_data, B = input$n_boot)
  data.frame(means = boot) %>%
    ggplot(aes(x = means)) +
    stat_histinterval(breaks = input$nbins, .width = input$conf_level, size = 20) +
    theme_minimal() +
    labs(x = paste("Mean of flipper length")) 
})

h3("Bootstrap percentile interval")
renderPrint({
  boot <- resample(obs_data, B = input$n_boot)
  alpha <- 1 - input$conf_level
  quantiles <- quantile(boot, probs = c(alpha/2, 1 - alpha/2))
  c(mean = mean(boot), quantiles)
})
  1. Create a list of reactive values/expressions that should be created outside the render* statements. Are any of these user inputs?
  • conf_level (user input)
  • nbins (user input)
  • n_boot (user input)
  • boot
  1. For each reactive expression you identify in step #5, which output (the histogram or the confidence interval) should observe that expression and update when it is changed?

    • n_bootboot → both
    • conf_level → both (but only interval)
    • nbins → histogram
  1. Use your answers to steps #5 & 6 to reorganize the server-side code for this Shiny app.

    • Create a reactive expression, boot(), to store the bootstrap statistics
boot <- reactive({
  resample(obs_data, B = input$n_boot)
})

  • Adapt the renderPlot() and renderPrint() statements to use the reactive object you just created.
renderPlot({
  boot <- resample(obs_data, B = input$n_boot)
  data.frame(means = boot) %>%
    ggplot(aes(x = means)) +
    stat_histinterval(breaks = input$nbins, .width = input$conf_level, size = 20) +
    theme_minimal() +
    labs(x = paste("Mean of flipper length"))
})

renderPrint({
  boot <- resample(obs_data, B = input$n_boot)
  alpha <- 1 - input$conf_level
  quantiles <- quantile(boot, probs = c(alpha/2, 1 - alpha/2))
  c(mean = mean(boot), quantiles)
})

  • Adapt the renderPlot() and renderPrint() statements to use the reactive object you just created.
boot <- reactive({
  resample(obs_data, B = input$n_boot)
})

renderPlot({
  data.frame(means = boot()) %>%
    ggplot(aes(x = means)) +
    stat_histinterval(breaks = input$nbins, .width = input$conf_level, size = 20) +
    theme_minimal() +
    labs(x = paste("Mean of flipper length"))
})

renderPrint({
  alpha <- 1 - input$conf_level
  quantiles <- quantile(boot(), probs = c(alpha/2, 1 - alpha/2))
  c(mean = mean(boot()), quantiles)
})

Summary

  • Statistics majors need to be able to program for and with data, not simply write data-driven code

  • For upper-division students, this can be done in traditional curriculum

  • Reactive programming in Shiny is another way to hone these skills

    • Can fall earlier (intro DS)
    • Students can see benefit in the app/dashboard